A Brief Foray into Prosper Loan Data

Author: Nicholas Hunt-Walker

Before we do anything, let’s call the requisite libraries and go to the right working directory.

library(ggplot2) # for making any plots
library(gridExtra) #for making multi-panel plots
library(GGally)
library(memisc) #for data analysis stuff
setwd("/Users/Nick/Documents/udacity/projects/project4/")

And now we load the data. Note that the column descriptions can be found here

prosper <- read.csv("prosperLoanData.csv")

Some things to know about the data set before moving forward:

  • 113,937 observations, each with 81 (!!!!) variables
  • Last updated 03/11/2014

Formatting the Data and Adding Variables

There are five columns that contain date information. These are:

  • ListingCreationDate
  • ClosedDate
  • DateCreditPulled
  • FirstRecordedCreditLine
  • LoanOriginationDate

These may be useful for later, but as they stand right now they’re not usable. Let’s make them datetime objects.

Write a function that takes in a dataframe and a column of choice, and adds columns to the dataframe that splits the datetime object. I want the year, the month (maybe cyclical?), the day, as well as the date as a decimal of the year.

split_to_columns <- function(df, input_colnum, output_colname_assoc_arr) {
  dates <- strptime(df[, input_colnum], format="%Y-%m-%d %H:%M:%S")
  df[output_colname_assoc_arr["year"]] <- as.numeric(strftime(dates, format="%Y"))
  df[output_colname_assoc_arr["month"]] <- as.numeric(strftime(dates, format="%m"))
  df[output_colname_assoc_arr["day"]] <- as.numeric(strftime(dates, format="%d"))
  df[output_colname_assoc_arr["decimal_date"]] <- df[output_colname_assoc_arr["year"]] + (as.numeric(strftime(dates, format="%j"))/366)
  
  return(df)
}
#column 3
listing_creation_date_set <- c()
listing_creation_date_set["year"] <- "ListingCreationDate_year"
listing_creation_date_set["month"] <- "ListingCreationDate_month"
listing_creation_date_set["day"] <- "ListingCreationDate_day"
listing_creation_date_set["decimal_date"] <- "ListingCreationDate_decimal"

#column 7
closed_date_set <- c()
closed_date_set["year"] <- "ClosedDate_year"
closed_date_set["month"] <- "ClosedDate_month"
closed_date_set["day"] <- "ClosedDate_day"
closed_date_set["decimal_date"] <- "ClosedDate_decimal"

#column 25
credit_pulled_set <- c()
credit_pulled_set["year"] <- "DateCreditPulled_year"
credit_pulled_set["month"] <- "DateCreditPulled_month"
credit_pulled_set["day"] <- "DateCreditPulled_day"
credit_pulled_set["decimal_date"] <- "DateCreditPulled_decimal"

#column 28
first_recorded_credit_set <- c()
first_recorded_credit_set["year"] <- "FirstRecordedCreditLine_year"
first_recorded_credit_set["month"] <- "FirstRecordedCreditLine_month"
first_recorded_credit_set["day"] <- "FirstRecordedCreditLine_day"
first_recorded_credit_set["decimal_date"] <- "FirstRecordedCreditLine_decimal"

#column 65
loan_orig_date_set <- c()
loan_orig_date_set["year"] <- "LoanOriginationDate_year"
loan_orig_date_set["month"] <- "LoanOriginationDate_month"
loan_orig_date_set["day"] <- "LoanOriginationDate_day"
loan_orig_date_set["decimal_date"] <- "LoanOriginationDate_decimal"

prosper <- split_to_columns(prosper, 3, listing_creation_date_set)
prosper <- split_to_columns(prosper, 7, closed_date_set)
prosper <- split_to_columns(prosper, 25, credit_pulled_set)
prosper <- split_to_columns(prosper, 28, first_recorded_credit_set)
prosper <- split_to_columns(prosper, 65, loan_orig_date_set)

Let’s make the credit score a little easier to work with by putting that average into our dataframe

prosper$CreditScoreMean = (prosper$CreditScoreRangeUpper + prosper$CreditScoreRangeLower)/2

First Plots: Histograms of potentially interesting things

Is the stated monthly income a continuous series of numbers or is it split into categories?

summary(prosper$StatedMonthlyIncome)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750000

Looks to be continuous to me, between 0 and $1.75 million, but weighted heavily to < 10,000. This is an…odd distribution. I should be able to make a histogram of that too and see what the real distribution is.

This is the stated monthly income at the time of borrowing

What about the original loan amount?

Hmm…. Is there some correlation between loan amount and stated monthly income?

with(prosper, cor.test(LoanOriginalAmount, StatedMonthlyIncome))
## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and StatedMonthlyIncome
## t = 69.353, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1956816 0.2068243
## sample estimates:
##       cor 
## 0.2012595

Correlation coefficient = 0.2012595. Quite lower than what I was expecting. Let’s see what the scatterplot says

I wonder if there’s any visible difference when coloring on Borrower APR instead of just term.

Somewhat. Smaller loans for those employed part time. A ton of borrowers are employed or self-employed though. Let’s see the employment statuses of borrowers more explicitly.

Let’s see them in their own element

This could be an interesting thing to look at. Is there a split on Term?

Oooh this is money right here. No pun intended.

Here’s another question then, is there a significant difference between the ways different employment statuses borrow over time?

The employed and full-time borrowers wash out everyone else, so let’s see what it looks like without them.

Curious about the Loan Principal vs the Stated Monthly Income

Hmm… what about vs the mean credit score

Onto other things. There’s a variable for the percent of the listing funded. I suspect there may be a correlation between the PercentFunded and the amount asked for.

with(prosper, cor.test(LoanOriginalAmount, PercentFunded))
## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and PercentFunded
## t = -3.4594, df = 113940, p-value = 0.0005416
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.016053721 -0.004441884
## sample estimates:
##         cor 
## -0.01024815

-0.01024815 is basically no correlation at all. Disappointment.

Let’s see what that PercentFunded parameter looks like

Oh that’s why. Because the vaaaaast majority of loans are fully funded. That was useless and enlightening all at once.

Let’s quantify just how many are fully funded

sum(with(prosper, (PercentFunded == 1.0) & !is.na(PercentFunded))) / sum(with(prosper, !is.na(PercentFunded))) * 100
## [1] 99.23642

99.236%! Yeah, we’re not going to get any meaningful correlations out of this one.

Another column is LP_NetPrincipalLoss. It’s the principal that remains uncollected after any recoveries. Curious to see if there’s any relationship between that and the loan amount. Of course, we start with the cor.test function, then plot some stuff up to see the distribution.

with(prosper, cor.test(LP_NetPrincipalLoss, LoanOriginalAmount))
## 
##  Pearson's product-moment correlation
## 
## data:  LP_NetPrincipalLoss and LoanOriginalAmount
## t = 43.359, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1216916 0.1331162
## sample estimates:
##       cor 
## 0.1274081

A correlation coefficient of 0.1274. Not the worst I’ve seen thus far, but still not great. Let’s see the plot

Ok now this is actually somewhat interesting. It seems that when there is a non-zero loss, there might actually be some correlation. Let’s test that again with cor.test

with(subset(prosper, LP_NetPrincipalLoss > 0), cor.test(LP_NetPrincipalLoss, LoanOriginalAmount))
## 
##  Pearson's product-moment correlation
## 
## data:  LP_NetPrincipalLoss and LoanOriginalAmount
## t = 251.12, df = 16644, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8862690 0.8926146
## sample estimates:
##       cor 
## 0.8894846

WHOA! 0.8895%! That is a very strong positive correlation between loss on the principal and the principal loan amount. Plot says…! (with a linear model overplotted, as well as a 1-1 line for reference)

Now that’s interesting. It seems that in many cases, especially for the smaller loans, when there’s a loss it’s on most of the loan if not all of the loan.

Let’s plot the median Net Principal Loss vs Loan Principal here. For clarity, we should probably bin up the loan principal.

There’s another dimension to this that I’m not looking at. Does Income Range have an effect on how much of the loan is lost?

No real effect here. Most of these losses are coming in for the $25 - 75k range. What about the rating of the loan? For reference, AA is the lowest risk loan, and HR is highest-risk.

Interestingly enough, the highest-risk loans aren’t dominating this landscape of loan principal loss. At the low end? Sure there’s a bunch. But most of the loans lost seem to be mid-range in risk. Another dimension we haven’t looked at yet is the actual loss vs. the estimated loss. Let’s check that out

There’s effectively no correlation. cor.test will confirm

with(subset(prosper, (LP_NetPrincipalLoss > 0)), cor.test(EstimatedLoss, LP_NetPrincipalLoss/LoanOriginalAmount))
## 
##  Pearson's product-moment correlation
## 
## data:  EstimatedLoss and LP_NetPrincipalLoss/LoanOriginalAmount
## t = 6.6659, df = 6173, p-value = 2.859e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.05972077 0.10925006
## sample estimates:
##        cor 
## 0.08453763

0.08453763 is basically nothing when only considering loans that have had losses.

with(prosper, cor.test(EstimatedLoss, LP_NetPrincipalLoss/LoanOriginalAmount))
## 
##  Pearson's product-moment correlation
## 
## data:  EstimatedLoss and LP_NetPrincipalLoss/LoanOriginalAmount
## t = 62.647, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2038176 0.2166797
## sample estimates:
##       cor 
## 0.2102577

0.2102577 with all loans, though still not well correlated when the data is plotted.

Let’s take these questions in a different direction and think from the point of view of the vendor. Let’s first ask, how much have lenders made on loans given? Is there some correlation with loan term? How about over time?

What this tells me is that no matter what, lenders have almost always made money on a loan. It’s also effectively telling me that the lender yield rate doesn’t depend much on time. Additionally, while the vast majority of loans are for 36 months, terms don’t seem to matter as much in whether or not there will be appreciable yield (although short term loans have a yield cap below 0.3). Interesting how lending effectively came to a stop between the end of 2008 and mid 2009. The financial crisis really shut things down.

ggplot(data=prosper, aes(x=LoanOriginationDate_decimal, y=LenderYield * LoanOriginalAmount)) +
  geom_point(alpha=0.2) +
  xlab("Loan Origination Date") +
  ylab("Lender Yield ($)") +
  scale_x_continuous(breaks=seq(2005, 2014, 1))

The yield rate isn’t really enough to tell you how much was actually MADE on the loan. Here we see the lender yield amount plotted over time, telling a somewhat different story. Lender yield reaches a height just before lending stops. After, it takes 4 years to begin to approach that level again. Amazing.

I’m curious about these parameters involving “investments made by friends”. Let’s do histograms and time-series stuff.

Largely fruitless. Except for that last one. Let’s look at that in some greater detail.

Hmm. I have an idea of how this plays out with respect to loan amount but I want to see it for myself. First, I need to bin up some data.

prosper$LoanOriginalAmount.buckets <- cut(prosper$LoanOriginalAmount, 
                                          quantile(prosper$LoanOriginalAmount,
                                                   probs=c(0, .2, .4, .6, .8, 1.)),
                                          labels = c("$1,000-3,000", "$3,000-5,000",
                                                     "$5,000-8,500", "$8,500-15,000",
                                                     "$15,000-35,000"))

Now revisit those investors

I don’t really…see this one going anywhere. I’ve stepped through a number of interesting variables though.

Another thing, let’s look at LenderYield with respect to the borrower rate. A simple relationship.

And it comes out as expected. The higher the borrower rate, the higher the lender yield.

What if we split this relationship up on different variables?

What else influences lender yield rate?

Hmm, now that’s interesting. There’s a slightly higher median yield rate from borrowers that DON’T own homes. I wonder why that is? Does term length matter?

No matter the term length, those who don’t own homes provide a larger Lender Yield Rate! What about income range?

Now that is interesting. Lender yield rates tend to be higher for borrowers that are not employed. And specifically, it bucks the trend that non-homeowning borrowers provide a generally higher yield than homeowning borrowers. What about the total yield?

ggplot(data=subset(prosper, !(IncomeRange %in% c("$0", "Not displayed"))), aes(x=IsBorrowerHomeowner, y=LenderYield * LoanOriginalAmount)) + geom_boxplot() +
  xlab("Homeowner Status") +
  ylab("Lender Yield Amount ($)") +
  facet_wrap(~IncomeRange)

Now this is more of what I expected, though the differences between homeowners and non-owners wasn’t as drastic as I’d expected. Here we just see that the overall yield increases as the household income increases, despite that the yield rate decreases with ownership.

Let’s check on one more thing that just came to mind: Debt/Income ratio vs avg credit rating, colored by loan amount

Not entirely surprising but still interesting to see, the highest loans are given to people with better credit scores, while lower loans are available for most folks. What was more interesting to see was that the $5,000 - $15,000 range of loans was available all the way down to a credit score of 600. That’s a lot lower than I had expected, so I’ve learned something there. Also interesting to see the few cases where high debt/income ratios were still given loans. They were few and far between. Note that any debt/income ratio larger than 10 is shown as 10.01. So the structure for really high debt/income ratios is pretty much lost.

I’m curious about the breakdown by the original loan amount buckets, so let’s do that plot and then move on.

Apparently, keeping your debt-to-income ratio beneath 0.4 and a credit score > 620 is a great way to qualify for a loan of any type.

Predicting Goodness of Borrower

I want to try to assess the goodness of a borrower with respect to the requested loan amount. I’ll parameterize a good borrower by the net principal loss (LP_NetPrincipalLoss). Obviously, I’d want to only look at loans that have been closed, so only where ClosedDate_decimal is not NA. Here are some parameters that I think will be important to my model:

  • LoanOriginalAmount: The loan’s original amount
  • CreditScoreMean: The average between the lower and upper range of credit score
  • EmploymentStatusDuration: The length in months of the employment status of the borrower when the listing was created
  • IsBorrowerHomeowner: Boolean for whether or not the borrower is a homeowner
  • OpenCreditLines: Number of open credit lines at the time the credit profile was pulled
  • InquiriesLast6Months: Number of inquiries in the past six months at the time the credit profile was pulled
  • CurrentDelinquencies: Number of accounts delinquent at the time the credit profile was pulled
  • AmountDelinquent: Dollars delinquent at the time the credit profile was pulled
  • AvailableBankcardCredit: The total available credit via bank card at the time the credit profile was pulled
  • StatedMonthlyIncome: The monthly income the borrower stated at the time the listing was created
  • DebtToIncomeRatio: The debt to income ratio of the borrower
prosper_clean <- subset(prosper, (CreditScoreMean > 250) & (LP_NetPrincipalLoss > 0) &
                          !is.na(LP_NetPrincipalLoss) & !is.na(CreditScoreMean) &
                          !is.na(EmploymentStatusDuration) & !is.na(OpenCreditLines) &
                          !is.na(InquiriesLast6Months) & !is.na(AmountDelinquent) &
                          !is.na(AvailableBankcardCredit) & !is.na(DebtToIncomeRatio) &
                          !is.na(ClosedDate_decimal))

n_entries <- length(prosper_clean$Term)
all_indices <- seq(1:n_entries)

set.seed(42)
rand_samp <- sample(1:n_entries, as.integer(0.75 * n_entries)) #replace = False

prosper_train <- prosper_clean[rand_samp, ]
prosper_test <- prosper_clean[!(all_indices %in% rand_samp), ]

m1 <- lm(I(LP_NetPrincipalLoss) ~ I(CreditScoreMean), 
         data=prosper_train)
m2 <- update(m1, ~ . + LoanOriginalAmount)
m3 <- update(m2, ~ . + EmploymentStatusDuration)
m4 <- update(m3, ~ . + IsBorrowerHomeowner)
m5 <- update(m4, ~ . + OpenCreditLines)
m6 <- update(m5, ~ . + InquiriesLast6Months)
m7 <- update(m6, ~ . + CurrentDelinquencies)
m8 <- update(m7, ~ . + AmountDelinquent)
m9 <- update(m8, ~ . + AvailableBankcardCredit)
m10 <- update(m9, ~ . + StatedMonthlyIncome)
m11 <- update(m10, ~ . + DebtToIncomeRatio)

mtable(m11)
## 
## Calls:
## m11: lm(formula = I(LP_NetPrincipalLoss) ~ I(CreditScoreMean) + LoanOriginalAmount + 
##     EmploymentStatusDuration + IsBorrowerHomeowner + OpenCreditLines + 
##     InquiriesLast6Months + CurrentDelinquencies + AmountDelinquent + 
##     AvailableBankcardCredit + StatedMonthlyIncome + DebtToIncomeRatio, 
##     data = prosper_train)
## 
## ================================================
## (Intercept)                         155.790     
##                                    (283.516)    
## I(CreditScoreMean)                   -0.223     
##                                      (0.432)    
## LoanOriginalAmount                    0.733***  
##                                      (0.005)    
## EmploymentStatusDuration              0.417     
##                                      (0.257)    
## IsBorrowerHomeowner: True/False    -116.942*    
##                                     (48.498)    
## OpenCreditLines                      -6.423     
##                                      (4.843)    
## InquiriesLast6Months                 12.168*    
##                                      (6.074)    
## CurrentDelinquencies                  2.979     
##                                      (8.833)    
## AmountDelinquent                      0.005     
##                                      (0.003)    
## AvailableBankcardCredit               0.003     
##                                      (0.002)    
## StatedMonthlyIncome                  -0.003     
##                                      (0.007)    
## DebtToIncomeRatio                   -56.851*    
##                                     (23.220)    
## ------------------------------------------------
## R-squared                                  0.800
## adj. R-squared                             0.800
## sigma                                   2029.329
## F                                       3318.337
## p                                          0.000
## Log-likelihood                        -82595.505
## Deviance                         37603059672.875
## AIC                                   165217.011
## BIC                                   165309.580
## N                                       9143    
## ================================================

The \(R^2\) coefficient is fairly high (0.800). I’ve learned from https://cran.r-project.org/web/packages/texreg/vignettes/v55i08.pdf that mtable throws asterisks onto coefficients of significance, with the number of asterisks corresponding to the significance level of that coefficient.

Perhaps we can refine our model. Let’s first take out the following coefficients that appear to matter least: CreditScoreMean, EmploymentStatusDuration, AmountDelinquent, CurrentDelinquencies, OpenCreditLines, StatedMonthlyIncome, AvailableBankcardCredit.

m1 <- lm(I(LP_NetPrincipalLoss) ~ I(LoanOriginalAmount), 
         data=prosper_train)
m2 <- update(m1, ~ . + LoanOriginalAmount)
m3 <- update(m2, ~ . + IsBorrowerHomeowner)
m4 <- update(m3, ~ . + InquiriesLast6Months)
m5 <- update(m4, ~ . + DebtToIncomeRatio)

mtable(m5)
## 
## Calls:
## m5: lm(formula = I(LP_NetPrincipalLoss) ~ I(LoanOriginalAmount) + 
##     LoanOriginalAmount + IsBorrowerHomeowner + InquiriesLast6Months + 
##     DebtToIncomeRatio, data = prosper_train)
## 
## ================================================
## (Intercept)                          18.118     
##                                     (39.817)    
## I(LoanOriginalAmount)                 0.731***  
##                                      (0.004)    
## IsBorrowerHomeowner: True/False    -128.066**   
##                                     (43.727)    
## InquiriesLast6Months                 11.915*    
##                                      (5.740)    
## DebtToIncomeRatio                   -58.216**   
##                                     (22.527)    
## ------------------------------------------------
## R-squared                                  0.800
## adj. R-squared                             0.800
## sigma                                   2029.719
## F                                       9119.292
## p                                          0.000
## Log-likelihood                        -82600.765
## Deviance                         37646351022.611
## AIC                                   165213.531
## BIC                                   165256.255
## N                                       9143    
## ================================================

The \(R^2\) value hasn’t changed a bit. Let’s test this out on our test set, find the residuals, and plot up some stuff.

estimated_losses <- predict(m5, newdata = prosper_test, interval="prediction",
                           level = 0.95)
prosper_test$residual_loss <- estimated_losses[, 1] - prosper_test$LP_NetPrincipalLoss

While it looks like I’m over-estimating a ton of entries, I think I might actually be underestimating a bunch and just not realize it.

summary(prosper_test$residual_loss/prosper_test$LP_NetPrincipalLoss)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  -0.7891  -0.2102  -0.0868   1.6860   0.2503 942.0000

On average I’m overestimating loss from a given borrower, though my median is saying I’m pretty close with my estimations. That mean is definitely affected by the few outliers, as it exists well outside of my 3rd quartile. Let’s histogram it up.

Last bits for this model. I want to check that these are consistent regardless of my random seed, so let’s do it again!

set.seed(79)
rand_samp <- sample(1:n_entries, as.integer(0.75 * n_entries)) #replace = False

prosper_train <- prosper_clean[rand_samp, ]
prosper_test <- prosper_clean[!(all_indices %in% rand_samp), ]

m1 <- lm(I(LP_NetPrincipalLoss) ~ I(LoanOriginalAmount), 
         data=prosper_train)
m2 <- update(m1, ~ . + LoanOriginalAmount)
m3 <- update(m2, ~ . + IsBorrowerHomeowner)
m4 <- update(m3, ~ . + InquiriesLast6Months)
m5 <- update(m4, ~ . + DebtToIncomeRatio)

mtable(m5)
## 
## Calls:
## m5: lm(formula = I(LP_NetPrincipalLoss) ~ I(LoanOriginalAmount) + 
##     LoanOriginalAmount + IsBorrowerHomeowner + InquiriesLast6Months + 
##     DebtToIncomeRatio, data = prosper_train)
## 
## ================================================
## (Intercept)                          23.605     
##                                     (40.185)    
## I(LoanOriginalAmount)                 0.727***  
##                                      (0.004)    
## IsBorrowerHomeowner: True/False    -131.634**   
##                                     (44.084)    
## InquiriesLast6Months                 12.605*    
##                                      (5.762)    
## DebtToIncomeRatio                   -43.710*    
##                                     (22.263)    
## ------------------------------------------------
## R-squared                                  0.796
## adj. R-squared                             0.796
## sigma                                   2047.828
## F                                       8910.741
## p                                          0.000
## Log-likelihood                        -82681.979
## Deviance                         38321125224.589
## AIC                                   165375.959
## BIC                                   165418.683
## N                                       9143    
## ================================================

Not bad! Not bad at all! Now, how do I interpret this in the context of the original idea of a borrower’s goodness?

  • The higher the loan amount, the more the bank should expect to lose on it. The more a borrower asks for, the higher the risk.
  • Owning a home correlates with less money lost on a loan. One can find good buyers amongst homeowners.
  • The more inquiries that have been made into a borrower’s credit history, the higher the risk.
  • The higher the debt-to-income ratio, the less a bank should expect to lose. This is surprising to say the least. I’d expect higher debt-to-income ratios to correlate with more money lost per loan, but this imperfect linear model says otherwise.

Univariate Analysis

What is the structure of your dataset?

The prosper data set contains 113,937 loans, each with 81 characteristics. Most characteristics are continuous, though a few are qualitative attributes:

  • CreditGrade: AA (best), A, B, C, D, E (worst), HR (high risk), NC (no credit)
  • Term: 12, 36, 60
  • LoanStatus: Completed, Current, Past Due (1-15 days), Defaulted, Chargedoff, Past Due (16-30 days), Cancelled, Past Due (61-90 days), Past Due (31-60 days), Past Due (91-120 days), FinalPaymentInProgress, Past Due (>120 days)
  • ProsperRating (numeric): 0, 1, 2, 3, 4, 5, 6, 7
  • ProsperRating (alpha): AA, A, B, C, D, E, HR
  • ProsperScore: 1 (worst), 2, 3, 4, 5, 6, 7, 8, 9, 10 (best)
  • ListingCategory: 0 (N/A), 1 (Debt Consolidation), 2 (Home Improvement), 3 (Business), 4 (Personal Loan), 5 (Student Use), 6 (Auto), 7 (Other), 8 (Baby&Adoption), 9 (Boat), 10 (Cosmetic Procedure), 11 (Engagement Ring), 12 (Green Loans), 13 (Household Expenses), 14 (Large Purchases), 15 (Medical/Dental), 16 (Motorcycle), 17 (RV), 18 (Taxes), 19 (Vacation), 20 (Wedding Loans)
  • BorrowerState: Two letter abbreviations for every state
  • Occupation: 68 different occupations
  • EmploymentStatus: Self-employed, Employed, Not available, Full-time, Other, Not employed, Part-time, Retired
  • IsBorrowerHomeowner: Boolean
  • CurrentlyInGroup: Boolean
  • IncomeVerifiable: Boolean
Other Observations:
  • Most loans have 3-year terms (77%), while another sizeable fraction have 5-year terms (21.5%).
  • Most loans are currently in payment or completed. A significant portion however have been Charged-off (10.5%), and a smaller portion have gone into default (4.4%).
  • Loans aren’t given below $1,000, and the largest loan in the data set is $35,000. Most loans are smaller loans, shown by how the median ($6,500) is below the mean ($8,337).
  • The state with the largest number of loans (by far) is California, with 14,717 loans. The next largest is Texas, with 6,842 loans.
  • The overwhelming majority of loans (51.2%) were for debt consolidation
  • The average borrower APR is 0.218
  • Interestingly enough, there isn’t one credit grade dominating the borrower population. The largest fraction have an average grade of about C, but there’s a pretty even smattering from A through E. The biggest differences are that AA rated borrowers make up the smallest sample, while HR rated borrowers are around the same level as AA rated borrowers.
  • The median debt-to-income ratio for borrowers is 0.22. The debt-to-income ratio is notably skewed to not just below 1, but well below 0.5.
  • It’s noted above, but there are no loans in this set from between November 2008 and April 2009. Coincident with the financial crisis, and even in April 2009 there’s only a small spike in loans before going dead again for about 2 months.
  • The split between borrowers that are and aren’t homeowners is pretty even. Homeowners outnumber non-owners by about 1000, which isn’t much considering it’s a difference betwen 57,478 and 56,459.

What is/are the main feature(s) of interest in your dataset?

The main features of the data set are LenderYield, LoanOriginalAmount, and LP_NetPrincipalLoss. Above I used other characteristics of loans to try to predict LP_NetPrincipalLoss on loans. It wasn’t half bad!

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I’d initially thought that CreditScoreMean, EmploymentStatusDuration, IsBorrowerHomeowner, OpenCreditLines, InquiriesLast6Months, CurrentDelinquencies, AmountDelinquent, AvailableBankcardCredit, StatedMonthlyIncome, and DebtToIncomeRatio were all important when considering whether a loan would end up with a net loss, due to several facts and suggestions gleaned from investigations into personal credit. However, it ended up being that only the Loan Original Amount, the homeowner status, the last 6 months of credit inquiries, and the debt to income ratio were significant (as far as an inherently-flawed linear model was concerned).

Did you create any new variables from existing variables in the dataset?

Quite a few actually. Any variable with date information I broke apart into columns containing the day, the month, the year, and the date as a decimal of the year using strptime and strftime.

I created a variable called LoanOriginalAmount.buckets, containing 5 ranges of dollar amounts within which a loan could reside. It was helpful for looking at some distributions with respect to the loan amount range.

Lastly, I created a variable for the mean credit score, based on the upper and lower limits of credit scores for each borrower. Its purpose was so that I could pin down roughly which credit scores correspond to which loan/borrower characteristics, instead of having to interpret from a range.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The only unusual distributions that I took note of involved the debt-to-income ratios and the mean credit scores. It’s noted in the variable definitions file that the debt-to-income ratio metric caps at 10.01, regardless of how high the actual ratio might be. That produced slight anomalies in my distribution, and I ended up focusing mostly on debt-to-income ratios below 1.

The mean credit scores were odd in that the distribution went down to 9.5 even though credit scores shouldn’t really be below about 360. For my model analysis I ended up just using only credit scores that were above 250 and not “NA”.

Additionally, specifically for my linear model, I made sure every loan considered had a principal loss > 0, had some employment status duration (since it was a part of my initial variable set), open credit lines, inquiries in the last 6 months, some delinquent amount, some bankcard credit, some debt to income ratio, and some close date (to ensure that the loan was finished and not just late in payment). Basically I wanted to make sure that every variable considered for my model had a value that could be used to train the model.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

One interesting thing is that the estimated loss rate on loans had effectively little correlation with the actual loss rate. This could be a couple things. One may be that I’m just misinterpreting the Estimated Loss The variable is described as “Estimated loss is the estimated principal loss on charge-offs.” I calculated the actual loss rate as LP_NetPrincipalLoss / LoanOriginalAmount, or in other words, the principal remaining uncollected after any recoveries divided by the original principal of the loan. It may not be a full 1-to-1 correlation between what Prosper refers to as a loss rate, but there should be some correlation if their estimated loss rate is based on real data. The correlation coefficient from with(prosper, cor.test(LP_NetPrincipalLoss/LoanOriginalAmount, EstimatedLoss)) produces a coefficient of 0.210. Even smaller when only considering loans that posted an actual loss (0.084).

On the other hand, the amount lost correlated pretty well with the loan principal, which is to some degree expected. The higher the loan, the more that can be lost. However, it’s interesting to note that for loans of higher principals, it happens more often that if there is a loss, then larger fractions of the loan are likely to be lost. This is illustrated pretty well in the following diagram:

Interestingly enough, the “mean” line in red says that no matter what the loan principal, on average there’s not much difference in the fraction lost (considering the scatter). However the density of points suggest differently.

Lastly, Loan Principal vs Mean Credit Score. Now it seems as if any credit score can get a small loan. However, to get the highest loans you need at least a credit score of 600. Oddly enough, borrowers with the highest credit scores (\(\ge800\)) don’t borrow the most. They borrow at about the same level of those with low credit scores (\(\le660\)).

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I found the relationship between the debt-to-income ratio and mean credit score to be somewhat interesting. Notably, the debt-to-income ratio seems to reach its highest in the middle of the mean Credit Score distribution. This is more of personal interest than intrinsic interest, as I’d expected the higher debt-to-income ratios to come with the lowest credit scores. Upon further thought though it makes sense. Borrowers with bad credit would only be permitted to borrow if their debt-to-income ratio is low. It’d be silly for a bank to lend money to someone shown to be bad with credit (low credit score) who also is already under a mountain of debt. On the other end, those with high credit would rarely have a high debt-to-income ratio.

The other relationship between non-main features I found interesting was mean credit score vs the listing category. For categories 4 and 5 (Personal Loan and Student Use), the credit score median and range are lower than all others. I suppose that necessity allows for those with lower credit to borrow. In contrast, a frivolous expense like a boat has a higher median credit score, as well as a much higher 75th percentile. All others are more or less in the same ranges for median and quartiles. What this means to me as a potential borrower is that if I want to get loan money from a bank, a credit score of about 710 is a good place to start.

One last one was between the total credit lines in the past 7 years vs the stated monthly income. It makes sense when I think about it, but apparently the more money you make, the more credit lines you’re likely to have opened. In fact, making more money seems to imply a lower limit to the amount of credit lines open!

What was the strongest relationship you found?

Objectively the strongest relationship that I saw for this data set was between the net principal loss and the loan principal amount. However, this isn’t really a fair assessment of strong relationships within this data set as these are not independent variables. Simply, the more money borrowed, the more money that can be lost when the loan isn’t paid. It would’ve been interesting to see if the loss RATE scaled with loan principal, but when I looked at that it showed an effectively flat distribution in loss rates across all loan principals.

The strongest relationship I found between seemingly unrelated components was the Total Credit Lines in the Past 7 Years vs Stated Monthly Income. The more money you make, the more credit you take.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Lender Yield is generally higher for higher-risk loans, depending of course on borrower APR. Flattened to 2D it’s easy to see that Lender Yield is higher for higher APR. That’s the point of having a higher APR.

A higher monthly income generally affords for higher loans, but being listed as employed (either as generally employed or full-time) shows a much wider field of potential in borrowed loans. Holding monthly income constant, more employment means you can borrow far larger loans, and self-employment isn’t quite as trusted.

Were there any interesting or surprising interactions between features?

Lender Yield vs homeowner status, split by income range. What surprised me were a couple things. The first was that unemployed homeowners provided a larger yield rate on their loans than unemployed people without homes. It’s surprising because it bucks the trend seen in every other income range. The other thing that was surprising was that lender yield rate was higher for those who made the LEAST money. I’d have expected that those who made more money would provide a higher yield rate, but perhaps it’s only that those who have little money to spare are given higher APRs as lenders don’t trust them as much as those with higher income.

Lender Yield vs borrower APR, split by credit grade. As mentioned above, lender yield is higher for higher-risk loans. However, the split on credit grade makes lenders almost seem predatory, making their larger yield (rates) on those with a shaky credit history. Coloring by the actual loan amount makes it seem a little less predatory, with higher-risk borrowers only acquiring generally smaller loans, however these are likely loans out of necessity. It makes me wonder what sorts of holes these borrowers are digging for themselves by taking out loans with shaky credit.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I created a linear model to try and predict the amount that may be lost on a closed loan. The model was trained on 75% of randomly-selected data that posted a loss (without replacement) and tested on the remaining 25% posting a loss. I wanted to use this estimated loss as a proxy for borrower reliability.

One strength of my model is that, based on nothing else besides the magnitudes of coefficients output by my model, it can predict principal loss pretty well, accounting for 80% of the variation in the original loan amount. A large fraction of my residuals are near zero and a fair number overestimate the loss, which in a more realistic setting would help the lending party be more conservative with who they lend to. Another strength is that the model is fairly consistent, regardless of which random 75% of my data is selected to train it.

A major weakness however is that this model is not at all based on any solid financial theory. I can’t use this model to back up any ideas about borrower behavior, it’s just how the numbers work. Another weakness is that the model only considers loans that HAVE a loss, in addition to only considering loans that have what I considered valid measurements in the fields I was using as variables. In reality only represents about 10.7% of the total data set, so this isn’t really a robust model and a potential lender would be remiss in using it to assess the actual reliability of a potential borrower. I would call it the extremely-pessimistic estimation of loss on a given loan.

Final Plots and Summary

Plot One

Description:

The distribution of employment statuses across borrowers, split by the term length of the loans they borrow. Because the y-xis is logarithmic, it’s interesting to see just how many loans are long-term. It could be that fully-employed (Employed, Full-time, and Self-employed) folks take out short term loans just to get over a hump or make a new purchase that they know they can cover in the near term with their income.

Plot Two

Description:

Box plots of the lender yield vs homeowner status, split by income range. The boxes show the 3rd/1st quartile plus/minus 1.5x the interquartile range of rates in the vertical lines (top/bottom). Additionally, it shows the first, second (median, thick), and third quartiles in the horizontal lines. The income ranges are based on the income of the borrower at the moment the loan was issued.

What we’re seeing here is that for every income range except where income wasn’t reported (the “NA” set), the lender yield rate is higher for those that don’t own homes. What’s more, the less the borrower makes the higher the rate of return for the lender (caused no doubt by a higher borrower APR). Interesting however that the trend of lender yield rates is so similar across 5 very different income ranges.

Plot Three

Description:

Scatter plots of the debt-to-income ratio of the borrower vs mean credit score, split by the original amount of each loan. Overplotted in green is the average debt-to-income ratio for each bin of average credit score, with a bin width of 20. The axis for debt-to-income ratios is heavily truncated; the maximum is 10.01 even though some debt-to-income ratios are actually higher, and I’m focusing on the bulk of the distribution, where the ratio is less than about 1.2.

It’s interesting to note that while the biggest loans are largely only available to those with credit scores > 600, loans of all sizes are issued for the full range of credit scores. What’s universal between loan sets is that the debt-to-income ratios of borrowers is chiefly below 0.7, with the highest loans ($15,000+) falling mostly below 0.4. So, while credit scores may take on a wide range, one’s debt compared to their income will be a large determinant in how much money Prosper will allow one to borrow. It’s also interesting to see that having the highest credit score doesn’t necessarily translate to taking out tons of credit debt. We see this in te distribution of debt-to-income ratios, which stay low at the higher credit scores; they probably maintain their credit by keeping low debt to begin with. Contrast this with the lower credit scores, where the debt to income ratios at that end are likely low because that’s the only way that they could be approved for a loan.

Reflection

The Prosper Loans data set contains characteristics for almost 114,000 loans. The set begins in 2005 and follows those who have borrowed from Prosper through early 2014. Before looking at any of the data, I scrolled through the 81 different characteristics outlined in the variable definitions document. I noted that several variables were dated, so I created new variables for each splitting these dates in to years, months, days, and a decimal date. I wanted to be able to call on these if I wanted to look at a characteristic over time and allow myself to use an actual quantitative time parameter. I later used one of those decimal dates to see the frequency at which loans were issued between 2005 and 2014. I also created a variable for the midpoint between the minimum and maximum credit scores for each borrower. There was a consistent difference between the upper and lower limits of 19, but I still figured getting a number somewhere in between was better than using either of them. Another variable created a little later on in my analysis was LoanOriginalAmount.buckets, to give proper titles to different ranges of loan amounts. It was very useful for splitting loans into different value brackets for further analysis.

After creating the appropriate variables, I started stepping through exploring various characteristics one-by-one, many of them split by the term length of the loan. After getting a cursory feel for some individual characteristics of interest, I started looking at relationships between variables. I did this through a combination of multivariate plots (scatter plots mostly), and correlation tests with cor.test. After striking on a few relationships that I found interesting, I started thinking in terms of borrower behavior. Specifically, I wanted to know if the data can tell me whether how much a lender can expect to lose on a loan. I ended up with a linear model that was halfway decent at predicting loan loss, though it was only trained on a very small fraction of the overall set.

It was somewhat difficult to find meaningful characteristics to explore without dumping a ton of time into an N-by-N plot with ggpairs. Even using just 6 categories was a slog for my machine; 81 was asking for trouble I didn’t want. Some of the characeristics that I had hoped would yield interesting fruit (e.g. PercentFunded, LenderYield, and LoanOriginalAmount vs Listing Creation Date) ended up not doing so.

Even though there were some difficulties, there were notable successes as well! For one, time series data yielded some interesting results. Specifically, I found it interesting how lending reflects the economy over time, showing a distinct period where loaning was a no-go during the financial crisis last decade. I was also able to see how lending behavior changed depending on conditions of the borrower, like credit scores, homeowner status, income and employment. These factored heavily into my initial assumptions when I finally created my linear model.

What I’ve done here isn’t by any means a fully-comprehensive analysis. For one, I could look into a model predicting income ranges to expect from potential borrowers. I could see what the daily, monthly, and annual yield was from loans, split by term length. With a set so abundant in characteristics, analysis could go on forever. The data is also not as fleshed out as I’d like it to be in order to see more interesting relationships. For example, we have data on the state of origin of the borrower which could provide insight into cultural borrowing trends. However we could get a more complete picture if we actually had ethnicity information. Similarly, we have some not-so-clean employment information, but highest level of education could also tell a lot about borrowers, and provide valuable information for future lending practices with regard to where desired yield may lie. And of course we could do more with data that’s up to date with loans from 2015.